Covid-19 Trends on the Global Level
Exploring worldwide Covid Data!
This is an R Markdown blog template. This document will be knit to HTML to produce a webpage that will be hosted publicly via GitHub.
Website publication work flow
You can include text, code, and output as usual.
Remember to take full advantage of Markdown and follow our Style
Guide.
Examples and additional guidance are provided below.
Take note of the the default code chunk options in the
setup code chunk. For example, unlike the rest of the Rmd
files we worked in this semester, the default code chunk option is
echo = FALSE, so you will need to set
echo = TRUE for any code chunks you would like to display
in the blog. You should be thoughtful and intentional about the code you
choose to display.
Tabitha
Lynca
Emma - Covid Recovery Times
For this PUG project we chose to continue to work with COVID data on
a global scale as opposed to just statewide data in Massachusetts. My
goal for this project was to discover what qualities and conditions of
countries were similar between those who took longer to “recover” from
COVID in 2021. To do this I worked with k-means clustering and a handful
of attributes in the OWID COVID dataset. I wanted specifically to use
attributes that did not technically “have anything to do with” COVID or
a pandemic to see how the country at its pre-pandemic level reacted. I
looked for poverty as well as income data in other sources, but it was
either incomplete or not in the years that I wanted so I instead stayed
within the OWID dataset. I narrowed down my variables to population
density, GDP per capita, the cardiovascular death rate, and diabetes
prevalence for each country as they were both the most complete datasets
and msot interesting to me. My next big step was to determine my
daystil variable, which would tell me how long it took, in
number of days, for countries to recover. I did some exploring on the
OWID website with their distribution displays and ended up deciding on
finding the number of days countries took, including all of 2020 but
only evaluating in 2021, to get to one death per million. I did not
evaluate all of 2020 for a few reasons: Things started out bad and got
worse, and I didn’t want that “started out” portion to be chosen for
when countries were at one death per million; The summer allowed for
much more outside time and less school with lowered rates so I wanted to
evaluate recovery from a winter where people would have been inside;
Countries that were affected later needed to be taken into
consideration. So, I evaluated only 2021 and forward for at what point
countries hit the 1 death per million mark.
I used the elbow plot to pick the number of clusters and ended up with three, in Figure 1. Four also would have been fine but I didn’t think that adding the complexity of four was going to be adding much to anything I was doing, so I just went ahead with three.
Figure 1
In terms of my grouping, I ended up with three clusters that were certainly different in terms of GDP per capita, cardiovascular death rate, and population density, but I had pretty close numbers for days until recovery with 312, 316, and 359 days in each group (Figure 2). I also noticed that Russia was a huge outlier in terms of days until recovery, so I tried clustering without Russia in the picture. I did the elbow plot again for this too and decided on three clusters as well. The withins are slightly lower for one group in the data without Russia than the data with Russia, but the change in data seemed to make more of a difference in terms of the visualization than in terms of the actual statistical results so I kept Russia in the data (Figure 3). I thought the display was most interesting and easy to see in terms of different clusters when plotting GDP per capita and days until recovery.
Figure 2
| Cluster | Cluster Size | Days until Recovery | GDP per Capita | Population Density | Cardiovascular Death Rate | Diabetes Prevalence | Withins |
|---|---|---|---|---|---|---|---|
| 1 | 9 | 312.1111 | 76769.75 | 1010.9996 | 154.3424 | 10.222222 | 2898206069 |
| 2 | 126 | 316.4524 | 7742.67 | 143.2296 | 303.1972 | 7.896111 | 3995840866 |
| 3 | 49 | 359.6122 | 34404.29 | 220.6610 | 187.6149 | 8.176327 | 3997169441 |
Figure 3
Finally, I removed all countries that “recovered” on day one of 2021, or in 297 days. There were many countries that were already under one death per million, and I wondered if their prevalence was messing with my data. I hesitate to make this my entire project, because there are certainly countries that did this and managed to stay in control all on their own but I think a large number of countries have this statistic because the reporting was more difficult to get a hold of. Results were very similar, in that I didn’t get huge differences between the number of days it took for countries to recover per group and the trends with GDP, population density, and my other variables were consistent.
For my one last sort of bonus step, I fit a multiple linear model to try to predict days until recovery, picking variables from the same set of variables I used for clustering and ending up just using GDP per capita and population density (Figure 4). It’s not as nuanced as clustering is, but it was interesting to compare some more “basic” observations about days until recovery in countries – higher GDP per capita and lower population density contribute to higher numbers of days until recovery – with what my clusters seemed to communicate. Looking at the clusters (3 clusters, including Russia), I found that the highest GDP per capita and highest population density as well as lowest GDP per capita and lowest population density was grouped with the lower days until recovery, whereas the middle ground in those two respects had a higher days until recovery. Diabetes prevalence and cardiovascular death rate did not seem to follow any specific pettern of higher or lower numbers grouping with days until recovery (Figure 2).
Figure 4
##
## Call:
## lm(formula = daystil ~ gdp_per_capita + population_density, data = covidCountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.76 -26.58 -21.54 -5.80 446.81
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.177e+02 6.353e+00 50.003 < 2e-16 ***
## gdp_per_capita 7.124e-04 2.500e-04 2.849 0.00489 **
## population_density -1.416e-02 7.676e-03 -1.845 0.06667 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.57 on 181 degrees of freedom
## Multiple R-squared: 0.04864, Adjusted R-squared: 0.03813
## F-statistic: 4.627 on 2 and 181 DF, p-value: 0.01097
I had high hopes for what my data could and would show, but I think there are some serious limitations to this project. My biggest challenge was figuring out how I was going to make my days until recovery variable. The number that I picked, 1 death per million, seems like a good metric to me but I think a better number would have been one that took the population density of the country into account. Even though this number is a per capita number, which takes into account the actual size of the population, I wonder if population density might guide me to finding different and more specific metrics. The 2020/2021 choice for when to start counting days was also a little shaky but I think the reasoning behind that is much more solid than the 1 death per million. Another thing I wish I could have implemented better is the poverty statistics I was looking to be able to cover for all countries (Instead of just SIX when I narrowed it down :( ) and something with healthcare. I was also unable to find that over the countries I wanted to predict, but I think it would have been really interesting to class levels of healthcare coverage and then use that to group countries with days until recovery.
Ultimately I think this is a really interesting concept, and if I had better and more complete data on a wider range of country attributes like housing type, rural vs. urban populations, healthcare coverage, etc. I would definitely have a wider range of things to choose from and probably more interesting results. Were that to exist, this data could help the WHO advise countries on what to expect from a pandemic based on their current situation and how possibly to put themselves in a better position to resist one.